Pesquisa | Portal Regional da BVS

1.

A noise audit of human-labeled benchmarks for machine commonsense reasoning.

Kejriwal, Mayank; Santos, Henrique; Shen, Ke; Mulvehill, Alice M; McGuinness, Deborah L.

Sci Rep ; 14(1): 8609, 2024 04 14.

Artigo em Inglês | MEDLINE | ID: mdl-38615039

RESUMO

With the advent of large language models, evaluating and benchmarking these systems on important AI problems has taken on newfound importance. Such benchmarking typically involves comparing the predictions of a system against human labels (or a single 'ground-truth'). However, much recent work in psychology has suggested that most tasks involving significant human judgment can have non-trivial degrees of noise. In his book, Kahneman suggests that noise may be a much more significant component of inaccuracy compared to bias, which has been studied more extensively in the AI community. This article proposes a detailed noise audit of human-labeled benchmarks in machine commonsense reasoning, an important current area of AI research. We conduct noise audits under two important experimental conditions: one in a smaller-scale but higher-quality labeling setting, and another in a larger-scale, more realistic online crowdsourced setting. Using Kahneman's framework of noise, our results consistently show non-trivial amounts of level, pattern, and system noise, even in the higher-quality setting, with comparable results in the crowdsourced setting. We find that noise can significantly influence the performance estimates that we obtain of commonsense reasoning systems, even if the 'system' is a human; in some cases, by almost 10 percent. Labeling noise also affects performance estimates of systems like ChatGPT by more than 4 percent. Our results suggest that the default practice in the AI community of assuming and using a 'single' ground-truth, even on problems requiring seemingly straightforward human judgment, may warrant empirical and methodological re-visiting.

Assuntos

Benchmarking , Resolução de Problemas , Humanos , Julgamento , Livros , Idioma

2.

Quantifying confidence shifts in a BERT-based question answering system evaluated on perturbed instances.

Shen, Ke; Kejriwal, Mayank.

PLoS One ; 18(12): e0295925, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38117790

RESUMO

Recent work on transformer-based neural networks has led to impressive advances on multiple-choice natural language processing (NLP) problems, such as Question Answering (QA) and abductive reasoning. Despite these advances, there is limited work still on systematically evaluating such models in ambiguous situations where (for example) no correct answer exists for a given prompt among the provided set of choices. Such ambiguous situations are not infrequent in real world applications. We design and conduct an experimental study of this phenomenon using three probes that aim to 'confuse' the model by perturbing QA instances in a consistent and well-defined manner. Using a detailed set of results based on an established transformer-based multiple-choice QA system on two established benchmark datasets, we show that the model's confidence in its results is very different from that of an expected model that is 'agnostic' to all choices that are incorrect. Our results suggest that high performance on idealized QA instances should not be used to infer or extrapolate similarly high performance on more ambiguous instances. Auxiliary results suggest that the model may not be able to distinguish between these two situations with sufficient certainty. Stronger testing protocols and benchmarking may hence be necessary before such models are deployed in front-facing systems or ambiguous decision making with significant human impact.

Assuntos

Armazenamento e Recuperação da Informação , Redes Neurais de Computação , Humanos , Processamento de Linguagem Natural

3.

TG-CSR: A human-labeled dataset grounded in nine formal commonsense categories.

Santos, Henrique; Mulvehill, Alice M; Shen, Ke; Kejriwal, Mayank; McGuinness, Deborah L.

Data Brief ; 51: 109666, 2023 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-37876745

RESUMO

Machine Common Sense Reasoning is the subfield of Artificial Intelligence that aims to enable machines to behave or make decisions similarly to humans in everyday and ordinary situations. To measure progress, benchmarks in the form of question-answering datasets have been developed and published in the community to evaluate machine commonsense models, including large language models. We describe the individual label data produced by six human annotators originally used in computing ground truth for the Theoretically-Grounded Commonsense Reasoning (TG-CSR) benchmark's composing datasets. According to a set of instructions, annotators were provided with spreadsheets containing the original TG-CSR prompts and asked to insert labels in specific spreadsheet cells during annotation sessions. TG-CSR data is organized in JSON files, individual raw label data in a spreadsheet file, and individual normalized label data in JSONL files. The release of individual labels can enable the analysis of the labeling process itself, including studies of noise and consistency across annotators.

4.

Quantifying COVID-19 policy impacts on subjective well-being during the early phase of the pandemic: A cross-sectional analysis of United States survey data from March to August 2020.

Shen, Ke; Kejriwal, Mayank.

PLoS One ; 18(9): e0291494, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37733714

RESUMO

To stop the spread of COVID-19, a number of public health policies and restrictions were implemented during the pre-vaccination phase of the pandemic. This study provides a quantitative assessment of how these policies impacted subjective well-being (SWB) in the United States over a 6-month period spanning March to August 2020. We study two specific research objectives. First, we aim to quantify the impacts of COVID-19 public health policies at different levels of stringency on SWB. Second, we train and implement a conditional inference tree model for predicting individual SWB based both on socio-demographic characteristics and policies then in place. Our results indicate that policies such as enforcing strict stay-at-home requirements and closing workplaces were negatively associated with SWB, and that an individual's socio-demographic characteristics, including income status, job, and gender, conditionally interact with policies such as workplace closure in a predictive model of SWB. Therefore, although such policies may have positive health implications, they also have secondary environmental and social implications that need to be taken into account in any cost-benefit analysis of such policies for future pandemic preparedness. Our proposed methodology suggests a way to quantify such impacts through the lens of SWB, and to further advance the science of pandemic preparedness from a public health perspective.

Assuntos

COVID-19 , Pandemias , Humanos , Estudos Transversais , Pandemias/prevenção & controle , COVID-19/epidemiologia , COVID-19/prevenção & controle , Política Pública , Análise Custo-Benefício

5.

Using conditional inference to quantify interaction effects of socio-demographic covariates of US COVID-19 vaccine hesitancy.

Shen, Ke; Kejriwal, Mayank.

PLOS Glob Public Health ; 3(5): e0001151, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37172006

RESUMO

COVID-19 vaccine hesitancy has become a major issue in the U.S. as vaccine supply has outstripped demand and vaccination rates slow down. At least one recent global survey has sought to study the covariates of vaccine acceptance, but an inferential model that makes simultaneous use of several socio-demographic variables has been lacking. This study has two objectives. First, we quantify the associations between common socio-demographic variables (including, but not limited to, age, ethnicity, and income) and vaccine acceptance in the U.S. Second, we use a conditional inference tree to quantify and visualize the interaction and conditional effects of relevant socio-demographic variables, known to be important correlates of vaccine acceptance in the U.S., on vaccine acceptance. We conduct a retrospective analysis on a COVID-19 cross-sectional Gallup survey data administered to a representative sample of U.S.-based respondents. Our univariate regression results indicate that most socio-demographic variables, such as age, education, level of household income and education, have significant association with vaccine acceptance, although there are key points of disagreement with the global survey. Similarly, our conditional inference tree model shows that trust in the (former) Trump administration, age and ethnicity are the most important covariates for predicting vaccine hesitancy. Our model also highlights the interdependencies between these variables using a tree-like visualization.

6.

Can language representation models think in bets?

Tang, Zhisheng; Kejriwal, Mayank.

R Soc Open Sci ; 10(3): 221585, 2023 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-36998768

RESUMO

In recent years, transformer-based language representation models (LRMs) have achieved state-of-the-art results on difficult natural language understanding problems, such as question answering and text summarization. As these models are integrated into real-world applications, evaluating their ability to make rational decisions is an important research agenda, with practical ramifications. This article investigates LRMs' rational decision-making ability through a carefully designed set of decision-making benchmarks and experiments. Inspired by classic work in cognitive science, we model the decision-making problem as a bet. We then investigate an LRM's ability to choose outcomes that have optimal, or at minimum, positive expected gain. Through a robust body of experiments on four established LRMs, we show that a model is able to 'think in bets' if it is first fine-tuned on bet questions with an identical structure. Modifying the bet question's structure, while still retaining its fundamental characteristics, decreases an LRM's performance by more than 25%, on average, although absolute performance remains well above random. LRMs are also found to be more rational when selecting outcomes with non-negative expected gain, rather than optimal or strictly positive expected gain. Our results suggest that LRMs could potentially be applied to tasks that rely on cognitive decision-making skills, but that more research is necessary before these models can robustly make rational decisions.

7.

Dataset for studying gender disparity in English literary texts.

Nagaraj, Akarsh; Kejriwal, Mayank.

Data Brief ; 41: 107905, 2022 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-35198684

RESUMO

Recent discourse has highlighted significant gender disparity in many aspects of economic, social and cultural life. With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corpora, such as copyright-expired literature in the pre-modern period (defined herein as books published approximately between 1800 and 1950) from the Project Gutenberg corpus. Nevertheless, there are challenges in using such tools, especially for maintaining high-enough quality to explore interesting hypotheses. We present a dataset and materials that illustrate how modern processes in NLP can be used on the raw text of more than 3,000 literary texts in Project Gutenberg to (i) extract characters and pronouns from the text with high quality, (ii) disambiguate characters so that they are not overcounted, (iii) detect the gender of each character. Furthermore, we also used manual labeling to determine the genders of authors who have published these texts, and published the labels as part of the dataset to facilitate future digital humanities research.

8.

Predicting zip code-level vaccine hesitancy in US Metropolitan Areas using machine learning models on public tweets.

Melotte, Sara; Kejriwal, Mayank.

PLOS Digit Health ; 1(4): e0000021, 2022 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-36812517

RESUMO

Although the recent rise and uptake of COVID-19 vaccines in the United States has been encouraging, there continues to be significant vaccine hesitancy in various geographic and demographic clusters of the adult population. Surveys, such as the one conducted by Gallup over the past year, can be useful in determining vaccine hesitancy, but can be expensive to conduct and do not provide real-time data. At the same time, the advent of social media suggests that it may be possible to get vaccine hesitancy signals at an aggregate level, such as at the level of zip codes. Theoretically, machine learning models can be learned using socioeconomic (and other) features from publicly available sources. Experimentally, it remains an open question whether such an endeavor is feasible, and how it would compare to non-adaptive baselines. In this article, we present a proper methodology and experimental study for addressing this question. We use publicly available Twitter data collected over the previous year. Our goal is not to devise novel machine learning algorithms, but to rigorously evaluate and compare established models. Here we show that the best models significantly outperform non-learning baselines. They can also be set up using open-source tools and software.

9.

Measuring spatio-textual affinities in twitter between two urban metropolises.

Hu, Minda; Kejriwal, Mayank.

J Comput Soc Sci ; 5(1): 227-252, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-34095601

RESUMO

With increasing growth of both social media and urbanization, studying urban life through the empirical lens of social media has led to some interesting research opportunities and questions. It is well-recognized that as a 'social animal', most humans are deeply embedded both in their cultural milieu and in broader society that extends well beyond close family, including neighborhoods, communities and workplaces. In this article, we study this embeddedness by leveraging urban dwellers' social media footprint. Specifically, we define and empirically study the issue of spatio-textual affinity by collecting many millions of geotagged tweets collected from two diverse metropolises within the United States: the Boroughs of New York City, and the County of Los Angeles. Spatio-textual affinity is the intuitive hypothesis that tweets coming from similar locations (spatial affinity) will tend to be topically similar (textual affinity). This simple definition of the problem belies the complexity of measuring it, since (re-tweets notwithstanding) two tweets are never truly identical either spatially or textually. Workable definitions of affinity along both dimensions are required, as are appropriate experimental designs, visualizations and measurements. In addition to providing such definitions and a viable framework for conducting spatio-textual affinity experiments on Twitter data, we provide detailed results illustrating how our framework can be used to compare and contrast two important metropolitan areas from multiple perspectives and granularities.

10.

Link Prediction Between Structured Geopolitical Events: Models and Experiments.

Kejriwal, Mayank.

Front Big Data ; 4: 779792, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34917934

RESUMO

Often thought of as higher-order entities, events have recently become important subjects of research in the computational sciences, including within complex systems and natural language processing (NLP). One such application is event link prediction. Given an input event, event link prediction is the problem of retrieving a relevant set of events, similar to the problem of retrieving relevant documents on the Web in response to keyword queries. Since geopolitical events have complex semantics, it is an open question as to how to best model and represent events within the framework of event link prediction. In this paper, we formalize the problem and discuss how established representation learning algorithms from the machine learning community could potentially be applied to it. We then conduct a detailed empirical study on the Global Terrorism Database (GTD) using a set of metrics inspired by the information retrieval community. Our results show that, while there is considerable signal in both network-theoretic and text-centric models of the problem, classic text-only models such as bag-of-words prove surprisingly difficult to outperform. Our results establish both a baseline for event link prediction on GTD, and currently outstanding challenges for the research community to tackle in this space.

11.

Reclaiming my name.

Kejriwal, Mayank.

Science ; 372(6547): 1238, 2021 06 11.

Artigo em Inglês | MEDLINE | ID: mdl-34112696

12.

On using centrality to understand importance of entities in the Panama Papers.

Kejriwal, Mayank.

PLoS One ; 16(3): e0248573, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-33765027

RESUMO

The Panama Papers comprise one of the most recent influential leaks containing detailed information on intermediary companies (such as law firms), offshore entities and company officers, and serve as a valuable source of insight into the operations of (approximately) 214,000 shell companies incorporated in tax havens around the globe over the past half century. Entities and relations in the papers can be used to construct a network that permits, in principle, a systematic and scientific study at scale using techniques developed in the computational social science and network science communities. In this paper, we propose such a study by attempting to quantify and profile the importance of entities. In particular, our research explores whether intermediaries are significantly more influential than offshore entities, and whether different centrality measures lead to varying, or even incompatible, conclusions. Some findings yield conclusions that resemble Simpson's paradox. We also explore the role that jurisdictions play in determining entity importance.

Assuntos

Comércio/legislação & jurisprudência , Humanos

13.

On detecting urgency in short crisis messages using minimal supervision and transfer learning.

Kejriwal, Mayank; Zhou, Peilin.

Soc Netw Anal Min ; 10(1): 58, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32834866

RESUMO

Humanitarian disasters have been on the rise in recent years due to the effects of climate change and socio-political situations such as the refugee crisis. Technology can be used to best mobilize resources such as food and water in the event of a natural disaster, by semi-automatically flagging tweets and short messages as indicating an urgent need. The problem is challenging not just because of the sparseness of data in the immediate aftermath of a disaster, but because of the varying characteristics of disasters in developing countries (making it difficult to train just one system) and the noise and quirks in social media. In this paper, we present a robust, low-supervision social media urgency system that adapts to arbitrary crises by leveraging both labeled and unlabeled data in an ensemble setting. The system is also able to adapt to new crises where an unlabeled background corpus may not be available yet by utilizing a simple and effective transfer learning methodology. Experimentally, our transfer learning and low-supervision approaches are found to outperform viable baselines with high significance on myriad disaster datasets.

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA